1 Introduction
This HTML notebook will replicate the STATA do file named vnm_mics14_dp2019 for distribution in R. The goal is to create an R script that does what the said do file does.
I will follow the same section numbering as in the do file for ease of comparision.
2 Libraries
Before we go into the excercise, following are the packages from which we will use various functions.
## here() starts at D:/R projects/OPHI/Vietnam Translation/Vietnam-MICS-MPI-to-R
## -- Attaching packages --------
## v ggplot2 3.3.0 v purrr 0.3.4
## v tibble 3.0.1 v dplyr 0.8.5
## v tidyr 1.0.3 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## Warning: package 'ggplot2' was built under R version 3.6.3
## Warning: package 'tibble' was built under R version 3.6.3
## Warning: package 'purrr' was built under R version 3.6.3
## Warning: package 'dplyr' was built under R version 3.6.3
## Warning: package 'forcats' was built under R version 3.6.3
## -- Conflicts -----------------
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## Warning: package 'janitor' was built under R version 3.6.3
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
## Warning: package 'gt' was built under R version 3.6.3
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
##
## set_names
## The following object is masked from 'package:tidyr':
##
## extract
3 Building a Well-Being Index from Vietnam MICS 2013-14
The memory and environment clearing commands and the commands for setting working folders and paths are not needed if one is working in Project mode through RStudio and uses the {here} (read as package here). This is a library for managing paths and directories.
4 Vietnam MICS 2014
4.1 Step 1: Data Preparation
__ Selecting main variables from CH, WM, HH & MN recode & merging with HL recode __
It should be noted that anthropometric data was not collected for children under 5 as part of the Viet Nam MICS 2014 dataset. Previously, nutrition data was collected as part of Viet Nam MICS 2011. However, the data was not collected in this round due to time and resource constraints as well as the availability of national nutrition survey data (p.61)
Above comments are copied from the STATA do file
4.1.1 Step 1.1: CH - Children’s Recode (Under 5)
According to the STATA do file there is no data for this section.
4.1.2 Step 1.2: BH - Birth Recode (All females 15-49 years who ever gave birth)
The purpose of step 1.2 is to identify children of any age who died in the last 5 years prior to the survey date. As seen in the STATA file
Loading the data from bh.sav.
bh_dat <- read_sav(file = here("Viet Nam_MICS5_Datasets",
"Viet Nam MICS 2013-14 SPSS Datasets",
"bh.sav"))
bh_dat <- clean_names(bh_dat)The above code chunk loads the bh.sav and names it bh_dat(a data object). The clean names function gets all the variable names in lower snake case. IN case if anyone is wondering what are all the possible cases, please refere to the wonderful art by Allison Horst shown below.
Various Cases (Let me know your favourite)
Now let us take a glimplse at the data and names of the variables.
## [1] "hh1" "hh2" "ln" "bhln" "bh2" "bh3"
## [7] "bh4m" "bh4y" "bh5" "bh6" "bh7" "bh8"
## [13] "bh9u" "bh9n" "bh10" "bh4c" "bh4f" "bh9c"
## [19] "bh9f" "hh6" "hh7" "wdoi" "wdob" "ethnicity"
## [25] "welevel" "brthord" "magebrt" "birthint" "wmweight" "wscore"
## [31] "windex5" "wscoreu" "windex5u" "wscorer" "windex5r" "windex2"
| hh1 | hh2 | ln | bhln | bh2 | bh3 | bh4m | bh4y | bh5 | bh6 | bh7 | bh8 | bh9u | bh9n | bh10 | bh4c | bh4f | bh9c | bh9f | hh6 | hh7 | wdoi | wdob | ethnicity | welevel | brthord | magebrt | birthint | wmweight | wscore | windex5 | wscoreu | windex5u | wscorer | windex5r | windex2 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 2 | 2 | 1 | 1 | 1 | 4 | 2011 | 1 | 2 | 1 | 4 | NA | NA | NA | 1336 | 1 | NA | NA | 1 | 1 | 1368 | 1044 | 1 | 4 | 1 | 2 | 1 | 2.167035 | 1.640778 | 5 | 1.337577 | 5 | NA | NA | 2 |
| 1 | 2 | 2 | 2 | 1 | 1 | 5 | 2013 | 1 | 0 | 1 | 5 | NA | NA | 2 | 1361 | 1 | NA | NA | 1 | 1 | 1368 | 1044 | 1 | 4 | 2 | 2 | 2 | 2.167035 | 1.640778 | 5 | 1.337577 | 5 | NA | NA | 2 |
| 1 | 3 | 2 | 1 | 1 | 2 | 3 | 2003 | 1 | 10 | 1 | 4 | NA | NA | NA | 1239 | 1 | NA | NA | 1 | 1 | 1368 | 939 | 1 | 3 | 1 | 2 | 1 | 2.167035 | 1.427797 | 5 | 1.071892 | 5 | NA | NA | 2 |
| 1 | 3 | 2 | 2 | 1 | 1 | 6 | 2007 | 1 | 6 | 1 | 5 | NA | NA | 2 | 1290 | 1 | NA | NA | 1 | 1 | 1368 | 939 | 1 | 3 | 2 | 2 | 4 | 2.167035 | 1.427797 | 5 | 1.071892 | 5 | NA | NA | 2 |
| 1 | 4 | 2 | 1 | 1 | 2 | 11 | 2007 | 1 | 6 | 1 | 4 | NA | NA | NA | 1295 | 1 | NA | NA | 1 | 1 | 1368 | 1008 | 1 | 4 | 1 | 2 | 1 | 2.167035 | 1.613568 | 5 | 1.303632 | 5 | NA | NA | 2 |
| 1 | 6 | 2 | 1 | 1 | 1 | 3 | 1995 | 1 | 18 | 1 | 3 | NA | NA | NA | 1143 | 1 | NA | NA | 1 | 1 | 1368 | 788 | 1 | 4 | 1 | 2 | 1 | 2.167035 | 1.624926 | 5 | 1.317802 | 5 | NA | NA | 2 |
Now let us create variables, only those that are non grouped, i.e. those generated without using bysort. In R one can create grouped varibales in a different manner. Will show that in the code chunk following this one.
Before moving forward, please note that, a large part of coding involved in creating this index is a data wrangling process. The tidyverse way of carrying out data wrangling extensively employs {dplyr} (read as ‘package dplyr’) and its mutate function. Here is a visual cue (created by Allison Horst) to what the mutate function does.
Wrangling and adding columns
bh_dat <- bh_dat %>%
mutate(
ind_id = structure(((hh1*100000)+(hh2*100)+ln),
label = "Individual ID"),
date_death = structure(bh4c + bh9c,
label = "Date of Death CMC"),
mdead_survey = structure(ifelse(((bh9c == 0 |is.na(bh9c)) & bh5 == 1),
NA,
wdoi - date_death),
label = "Months Dead from Survey"),
ydead_survey = structure(mdead_survey/12,
label = "Years dead from Survey"),
age_death = structure(ifelse(bh5 == 2, bh9c, NA),
label = "Age at death in Months"),
child_died = structure(forcats::as_factor(
ifelse(is.na(bh5),NA,
ifelse(bh5==1,"child is alive","child is dead")))),
child18_died = forcats::as_factor(
ifelse((!is.na(age_death) & age_death > 216),
ifelse(is.na(bh5),NA,
ifelse(bh5==1,
"child is alive",
"child is dead")),
"child is alive"))
)Creating Tables for new variables check.
| age_death | n | percent | valid_percent |
|---|---|---|---|
| 0 | 242 | 1.563408e-02 | 0.455743879 |
| 1 | 33 | 2.131921e-03 | 0.062146893 |
| 2 | 17 | 1.098262e-03 | 0.032015066 |
| 3 | 17 | 1.098262e-03 | 0.032015066 |
| 4 | 9 | 5.814329e-04 | 0.016949153 |
| 5 | 5 | 3.230183e-04 | 0.009416196 |
| 6 | 14 | 9.044512e-04 | 0.026365348 |
| 7 | 6 | 3.876219e-04 | 0.011299435 |
| 8 | 10 | 6.460366e-04 | 0.018832392 |
| 9 | 2 | 1.292073e-04 | 0.003766478 |
| 11 | 5 | 3.230183e-04 | 0.009416196 |
| 12 | 23 | 1.485884e-03 | 0.043314501 |
| 14 | 5 | 3.230183e-04 | 0.009416196 |
| 18 | 3 | 1.938110e-04 | 0.005649718 |
| 19 | 1 | 6.460366e-05 | 0.001883239 |
| 20 | 2 | 1.292073e-04 | 0.003766478 |
| 23 | 4 | 2.584146e-04 | 0.007532957 |
| 24 | 22 | 1.421280e-03 | 0.041431262 |
| 36 | 27 | 1.744299e-03 | 0.050847458 |
| 48 | 14 | 9.044512e-04 | 0.026365348 |
| 60 | 6 | 3.876219e-04 | 0.011299435 |
| 72 | 9 | 5.814329e-04 | 0.016949153 |
| 84 | 3 | 1.938110e-04 | 0.005649718 |
| 96 | 5 | 3.230183e-04 | 0.009416196 |
| 108 | 6 | 3.876219e-04 | 0.011299435 |
| 120 | 1 | 6.460366e-05 | 0.001883239 |
| 132 | 1 | 6.460366e-05 | 0.001883239 |
| 144 | 3 | 1.938110e-04 | 0.005649718 |
| 156 | 1 | 6.460366e-05 | 0.001883239 |
| 168 | 3 | 1.938110e-04 | 0.005649718 |
| 180 | 1 | 6.460366e-05 | 0.001883239 |
| 192 | 7 | 4.522256e-04 | 0.013182674 |
| 204 | 3 | 1.938110e-04 | 0.005649718 |
| 216 | 7 | 4.522256e-04 | 0.013182674 |
| 228 | 4 | 2.584146e-04 | 0.007532957 |
| 240 | 3 | 1.938110e-04 | 0.005649718 |
| 252 | 1 | 6.460366e-05 | 0.001883239 |
| 264 | 2 | 1.292073e-04 | 0.003766478 |
| 276 | 3 | 1.938110e-04 | 0.005649718 |
| 312 | 1 | 6.460366e-05 | 0.001883239 |
| NA | 14948 | 9.656955e-01 | NA |
## Warning: Column `bh5` has different attributes on LHS and RHS of join
| bh5 | child is alive | child is dead |
|---|---|---|
| 1 | 14948 | 0 |
| 2 | 0 | 531 |
| child18_died | n | percent |
|---|---|---|
| child is alive | 15465 | 0.9990955488 |
| child is dead | 14 | 0.0009044512 |
Creating grouped variables.
Notice that I need not write equivalent R code for the following STATA code chunks:
1> This has been included in the grouped condition for the variable
replace tot_child18_died_5y=. if child18_died==1 & ydead_survey==. //Replace as ‘.’ if there is no information on when the child died
2> This need not be written as the structure of data is different, only one rwo exists per ind_id NOte that due to this feature there is no need to create “child18u_died_per_wom_5y” as it is same as “tot_child18_died_5y”
bysort ind_id: egen childu18_died_per_wom_5y = max(tot_child18_died_5y) lab var childu18_died_per_wom_5y “Total child under 18 death for each women in the last 5 years (birth recode)”
3> This need not be written as the structure of data is different, only one rwo exists per ind_id
//Keep one observation per women bysort ind_id: gen id=1 if _n==1 keep if id==1 drop id duplicates report ind_id
bh_dat %>%
group_by(ind_id) %>%
summarise(
tot_child_died = sum(child_died == "child is dead", na.rm = T),
tot_child18_died_5y = sum(
child18_died[ydead_survey <= 5 & !is.na(ydead_survey)] == "child is dead", na.rm = T),
women_BH = 1
) %>%
mutate(
tot_child18_died_5y = ifelse(
(is.na(tot_child18_died_5y) & tot_child_died>=0 & !is.na(tot_child_died)),0,tot_child18_died_5y)
)-> grouped_bh_datCreating variables for new grouped variables
## tot_child_died 0 1
## 0 6647 0
## 1 382 7
## 2 55 0
## 3 6 2
## 4 2 0
Writing new files.
4.1.3 Step 1.3: WM - Women’s Record (All eligible females 15-49 in the household)
Loading data and setting case for variable (column) names.
wm_dat <- read_sav(file = here("Viet Nam_MICS5_Datasets",
"Viet Nam MICS 2013-14 SPSS Datasets",
"wm.sav"))
wm_dat <- clean_names(wm_dat)Mutating the data for creating new variables. This is what one would refere to generating new varibles in STATA, I think.
Note, for the varible marital or marital_wom, I miss to see whyn the STATA code defines it as marital to begin with and renames it to marital_wom. I will define this varibale as marital_wom from the begining itself. Kindly bring to notice if this is incorrect.
wm_dat <- wm_dat %>%
mutate(
ind_id = structure(((hh1*100000)+(hh2*100)+ln),
label = "Individual ID"),
women_wm = 1,
marital_wom = structure(
forcats::as_factor(
case_when(
mstatus == 3 & is.na(ma6) ~ "Never Married",
mstatus == 1 & is.na(ma6) ~ "Currently Married",
mstatus == 2 & ma6 == 1 ~ "Widowed",
mstatus == 2 & ma6 == 2 ~ "Divorced",
mstatus == 2 & ma6 == 3 ~ "Seperated/Not living together",
)
), label = "Marital Status of Household Member"
)
) While creating new variables the STATA code has various tables. I will generate those here.
## [1] TRUE
This means all Individual Ids are unique, i.e., every row represents a unique individual.
| cm1 | 1 | 2 | NA_ |
|---|---|---|---|
| 1 | 454 | 6647 | 0 |
| 2 | 0 | 2726 | 0 |
| NA | 0 | 0 | 363 |
## # A tibble: 10 x 2
## mstatus ma6
## <dbl+lbl> <dbl+lbl>
## 1 1 [Currently married/in union] NA
## 2 1 [Currently married/in union] NA
## 3 1 [Currently married/in union] NA
## 4 3 [Never married/in union] NA
## 5 3 [Never married/in union] NA
## 6 3 [Never married/in union] NA
## 7 3 [Never married/in union] NA
## 8 1 [Currently married/in union] NA
## 9 1 [Currently married/in union] NA
## 10 3 [Never married/in union] NA
| mstatus | 1 | 2 | 3 | NA_ |
|---|---|---|---|---|
| 1 | 0 | 0 | 0 | 6972 |
| 2 | 207 | 193 | 107 | 0 |
| 3 | 0 | 0 | 0 | 2348 |
| NA | 0 | 0 | 0 | 363 |
| marital_wom | n | percent | valid_percent |
|---|---|---|---|
| Currently Married | 6972 | 0.68420020 | 0.70947390 |
| Never Married | 2348 | 0.23042198 | 0.23893355 |
| Widowed | 207 | 0.02031403 | 0.02106441 |
| Divorced | 193 | 0.01894014 | 0.01963977 |
| Seperated/Not living together | 107 | 0.01050049 | 0.01088837 |
| NA | 363 | 0.03562316 | NA |
| ma6 | Currently Married | Never Married | Widowed | Divorced | Seperated/Not living together | NA_ |
|---|---|---|---|---|---|---|
| 1 | 0 | 0 | 207 | 0 | 0 | 0 |
| 2 | 0 | 0 | 0 | 193 | 0 | 0 |
| 3 | 0 | 0 | 0 | 0 | 107 | 0 |
| NA | 6972 | 2348 | 0 | 0 | 0 | 363 |
| mstatus | Currently Married | Never Married | Widowed | Divorced | Seperated/Not living together | NA_ |
|---|---|---|---|---|---|---|
| 1 | 6972 | 0 | 0 | 0 | 0 | 0 |
| 2 | 0 | 0 | 207 | 193 | 107 | 0 |
| 3 | 0 | 2348 | 0 | 0 | 0 | 0 |
| NA | 0 | 0 | 0 | 0 | 0 | 363 |
Keeping relevant columns.
Writing the tables in csv.
4.1.4 Step 1.4: MR - MEn’s Recode (All eligible men in the household)
As mentioned in the STATA code, Note: There is no male recode file for Viet Nam MICS 2014. Hence the commands under this section have been remove
4.1.5 Step 1.5 HH - Household Record (All Households intereviewed)
Load data.
hh_dat <- read_sav(file = here("Viet Nam_MICS5_Datasets",
"Viet Nam MICS 2013-14 SPSS Datasets",
"hh.sav"))
hh_dat <- clean_names(hh_dat)Creating new variables.
write data in csv.
4.1.6 step 1.6: HL - Household Member
Load data.
hl_dat <- read_sav(file = here("Viet Nam_MICS5_Datasets",
"Viet Nam MICS 2013-14 SPSS Datasets",
"hl.sav"))
hl_dat <- clean_names(hl_dat)Creating new variables.
4.1.7 Step 1.7: Data Merging
Unlike the STATA code the merging here will look a little different. It will be in a single flow. However, checks will be shown prior to mrunnig the merge command.
hl_dat %>%
left_join(grouped_bh_dat, by = c("ind_id" = "ind_id")) %>%
left_join(wm_dat, by = c("ind_id" = "ind_id")) %>%
tabyl(hl7, show_na = T)## hl7 n percent
## 0 29190 0.7412392077
## 1 1065 0.0270441849
## 2 4650 0.1180802438
## 3 2037 0.0517267649
## 4 1479 0.0375571356
## 5 541 0.0137379380
## 6 228 0.0057897410
## 7 110 0.0027932961
## 8 38 0.0009649568
## 9 27 0.0006856272
## 10 9 0.0002285424
## 11 3 0.0000761808
## 13 2 0.0000507872
## 15 1 0.0000253936
hl_dat %>%
left_join(grouped_bh_dat, by = c("ind_id" = "ind_id")) %>%
left_join(wm_dat, by = c("ind_id" = "ind_id")) %>%
mutate(temp = hl7 > 0) %>%
tabyl(women_wm, temp, show_na = T)## women_wm FALSE TRUE
## 1 0 10190
## NA 29190 0
hl_dat %>%
left_join(grouped_bh_dat, by = c("ind_id" = "ind_id")) %>%
left_join(wm_dat, by = c("ind_id" = "ind_id")) %>%
mutate(temp = hl7 > 0) %>%
filter((temp == 1 & is.na(women_wm))) %>%
tabyl(wm7, show_na = T)## [1] wm7 n percent
## <0 rows> (or 0-length row.names)
hl_dat %>%
left_join(grouped_bh_dat, by = c("ind_id" = "ind_id")) %>%
left_join(wm_dat, by = c("ind_id" = "ind_id")) %>%
left_join(hh_dat, by = c("hh_id" = "hh_id")) %>%
filter(hh9 ==1) %>% # keeping only households that completed interview
mutate(
marital_men = structure(NA, label = "Marital Status of household member")
) -> merged_datWriting this object for safe keeping.
4.1.8 Step 1.8: Control Variables
Comments from the STATA code:
Households are identified as having ‘no eligible’ members if there are no applicable population, that is, children 0-5 years, adult women 15-49 years or adult men. These households will not have information on relevant indicators of health. As such, these households are considered as non-deprived in those relevant indicators.
As one must have noticed from previous sections, I prefer not to gererate variables that are grouped in the same data frame. However, in the following sections I will follow as the code in STATA do file. This is because the code seems to mutate the same data frame starting from section Step 1.6.
Having said this, I should explain why I am averse to this practice. Consider the following example.
tibble::tibble(
a = c(1,2,3,4,5,6),
b= rep(letters[1:3],2)
) -> data_example ## data for illustration
data_example## # A tibble: 6 x 2
## a b
## <dbl> <chr>
## 1 1 a
## 2 2 b
## 3 3 c
## 4 4 a
## 5 5 b
## 6 6 c
Say, we want sum of column a for each type of observation in column b. We shall name this new variable c.
## # A tibble: 6 x 3
## # Groups: b [3]
## a b c
## <dbl> <chr> <dbl>
## 1 1 a 5
## 2 2 b 7
## 3 3 c 9
## 4 4 a 5
## 5 5 b 7
## 6 6 c 9
Above is the output we would generate if we wanted a sum of ‘a’ grouped by ‘b’. This might end up being misleading to someone who has neve seen the code but want to use the data. Some might do a simple average of cloumn c or try other summary statistics over this new column.
Moving on.
4.1.8.1 No Eligible Women 15-49 years
merged_dat %>%
mutate(
fem_eligible = ifelse(hl7 > 0 & !is.na(hl7), TRUE, NA), ## if hl7 is greater than 0, it cannot be NA, but I will stick to the STATA code for now
) %>%
group_by(hh_id) %>%
mutate(
hh_n_fem_eligible = sum(fem_eligible, na.rm = T)
) %>%
ungroup() %>%
mutate(
no_fem_eligible = structure(hh_n_fem_eligible == 0, label = "Household has no eligible women")
) %>%
tabyl(no_fem_eligible, show_na = T) %>%
gt()| no_fem_eligible | n | percent |
|---|---|---|
| FALSE | 33847 | 0.8594972 |
| TRUE | 5533 | 0.1405028 |
4.1.8.2 No Men Eligible
merged_dat %>%
mutate(
fem_eligible = ifelse(hl7 > 0 & !is.na(hl7), TRUE, NA), ## if hl7 is greater than 0, it cannot be NA, but I will stick to the STATA code for now
) %>%
group_by(hh_id) %>%
mutate(
hh_n_fem_eligible = sum(fem_eligible, na.rm = T)
) %>%
ungroup() %>%
mutate(
no_fem_eligible = structure(hh_n_fem_eligible == 0, label = "Household has no eligible women"),
no_male_eligible = structure(NA, label = "Household has no eligible man")
) %>%
tabyl(no_male_eligible, show_na = T) %>%
gt()| no_male_eligible | n | percent | valid_percent |
|---|---|---|---|
| NA | 39380 | 1 | NA |
4.1.8.3 NO Eligible Children 0-5 years
merged_dat %>%
mutate(
fem_eligible = ifelse(hl7 > 0 & !is.na(hl7), TRUE, NA), ## if hl7 is greater than 0, it cannot be NA, but I will stick to the STATA code for now
) %>%
group_by(hh_id) %>%
mutate(
hh_n_fem_eligible = sum(fem_eligible, na.rm = T)
) %>%
ungroup() %>%
mutate(
no_fem_eligible = structure(hh_n_fem_eligible == 0, label = "Household has no eligible women"),
no_male_eligible = structure(NA, label = "Household has no eligible man"),
no_child_eligible = structure(NA, label = "Household has no children eligible")
) %>%
tabyl(no_child_eligible) %>%
gt()| no_child_eligible | n | percent | valid_percent |
|---|---|---|---|
| NA | 39380 | 1 | NA |
4.1.8.4 No Eligible Women and Men
merged_dat %>%
mutate(
fem_eligible = ifelse(hl7 > 0 & !is.na(hl7), TRUE, NA), ## if hl7 is greater than 0, it cannot be NA, but I will stick to the STATA code for now
) %>%
group_by(hh_id) %>%
mutate(
hh_n_fem_eligible = sum(fem_eligible, na.rm = T)
) %>%
ungroup() %>%
mutate(
no_fem_eligible = structure(hh_n_fem_eligible == 0, label = "Household has no eligible women"),
no_male_eligible = structure(NA, label = "Household has no eligible man"),
no_child_eligible = structure(NA, label = "Household has no children eligible"),
no_adults_eligible = structure(NA,
label = "Household has no eligible women and men")
) %>%
tabyl(no_adults_eligible, show_na = T) %>%
gt()| no_adults_eligible | n | percent | valid_percent |
|---|---|---|---|
| NA | 39380 | 1 | NA |
4.1.8.5 NO Eligible Children and Women
merged_dat %>%
mutate(
fem_eligible = ifelse(hl7 > 0 & !is.na(hl7), TRUE, NA), ## if hl7 is greater than 0, it cannot be NA, but I will stick to the STATA code for now
) %>%
group_by(hh_id) %>%
mutate(
hh_n_fem_eligible = sum(fem_eligible, na.rm = T)
) %>%
ungroup() %>%
mutate(
no_fem_eligible = structure(hh_n_fem_eligible == 0, label = "Household has no eligible women"),
no_male_eligible = structure(NA, label = "Household has no eligible man"),
no_child_eligible = structure(NA, label = "Household has no children eligible"),
no_adults_eligible = structure(NA,
label = "Household has no eligible women and men"),
no_child_fem_eligible = structure(NA,
label = "Household has no children or women eligible")
) %>%
tabyl(no_child_fem_eligible, show_na = T) %>%
gt()| no_child_fem_eligible | n | percent | valid_percent |
|---|---|---|---|
| NA | 39380 | 1 | NA |
4.1.8.6 No Eligible Women, Men or Children
merged_dat %>%
mutate(
fem_eligible = ifelse(hl7 > 0 & !is.na(hl7), TRUE, NA), ## if hl7 is greater than 0, it cannot be NA, but I will stick to the STATA code for now
) %>%
group_by(hh_id) %>%
mutate(
hh_n_fem_eligible = sum(fem_eligible, na.rm = T)
) %>%
ungroup() %>%
mutate(
no_fem_eligible = structure(hh_n_fem_eligible == 0, label = "Household has no eligible women"),
no_male_eligible = structure(NA, label = "Household has no eligible man"),
no_child_eligible = structure(NA, label = "Household has no children eligible"),
no_adults_eligible = structure(NA,
label = "Household has no eligible women and men"),
no_child_fem_eligible = structure(NA,
label = "Household has no children or women eligible"),
no_eligibles = structure(NA,
label = "Household has no eligible women, men or children")
) %>%
tabyl(no_eligibles, show_na = T) %>%
gt()| no_eligibles | n | percent | valid_percent |
|---|---|---|---|
| NA | 39380 | 1 | NA |
4.1.8.7 No Eligible Subsample
merged_dat %>%
mutate(
fem_eligible = ifelse(hl7 > 0 & !is.na(hl7), TRUE, NA), ## if hl7 is greater than 0, it cannot be NA, but I will stick to the STATA code for now
) %>%
group_by(hh_id) %>%
mutate(
hh_n_fem_eligible = sum(fem_eligible, na.rm = T)
) %>%
ungroup() %>%
mutate(
no_fem_eligible = structure(hh_n_fem_eligible == 0, label = "Household has no eligible women"),
no_male_eligible = structure(NA, label = "Household has no eligible man"),
no_child_eligible = structure(NA, label = "Household has no children eligible"),
no_adults_eligible = structure(NA,
label = "Household has no eligible women and men"),
no_child_fem_eligible = structure(NA,
label = "Household has no children or women eligible"),
no_eligibles = structure(NA,
label = "Household has no eligible women, men or children"),
no_hem_eligible = structure(NA, label = "Household has no eligible individuals for hemoglobin measurement")
) %>%
select(-c("fem_eligible", "hh_n_fem_eligible"))## # A tibble: 39,380 x 217
## hh1.x hh2.x hl1 hl3 hl4 hl5m hl5y hl6 hl7 hl7b
## <dbl> <dbl> <dbl> <dbl+l> <dbl+l> <dbl+l> <dbl> <dbl> <dbl+l> <dbl+l>
## 1 1 1 1 1 [Hea~ 1 [Mal~ 5 1964 49 0 [Not~ 0 [Not~
## 2 1 1 2 2 [Wif~ 2 [Fem~ 6 1963 50 0 [Not~ 0 [Not~
## 3 1 1 3 3 [Son~ 1 [Mal~ 10 1998 15 0 [Not~ 0 [Not~
## 4 1 2 1 1 [Hea~ 2 [Fem~ 9 1957 56 0 [Not~ 0 [Not~
## 5 1 2 2 3 [Son~ 2 [Fem~ 12 1986 27 2 0 [Not~
## 6 1 2 3 4 [Son~ 1 [Mal~ 11 1983 30 0 [Not~ 0 [Not~
## 7 1 2 4 5 [Gra~ 1 [Mal~ 4 2011 2 0 [Not~ 4
## 8 1 2 5 5 [Gra~ 1 [Mal~ 5 2013 0 0 [Not~ 5
## 9 1 3 1 1 [Hea~ 2 [Fem~ 98 [DK] 1943 70 0 [Not~ 0 [Not~
## 10 1 3 2 3 [Son~ 2 [Fem~ 3 1978 35 2 0 [Not~
## # ... with 39,370 more rows, and 207 more variables: hl11 <dbl+lbl>,
## # hl12 <dbl+lbl>, hl12a <dbl+lbl>, hl13 <dbl+lbl>, hl14 <dbl+lbl>,
## # hl14a <dbl+lbl>, hl15 <dbl>, ed1 <dbl>, ed3 <dbl+lbl>, ed4a <dbl+lbl>,
## # ed4b <dbl+lbl>, ed5 <dbl+lbl>, ed6a <dbl+lbl>, ed6b <dbl+lbl>,
## # ed7 <dbl+lbl>, ed8a <dbl+lbl>, ed8b <dbl+lbl>, mp3 <dbl>, mp5 <dbl+lbl>,
## # mp6 <dbl+lbl>, mp7 <dbl+lbl>, hh5d.x <dbl>, hh5m.x <dbl>, hh5y.x <dbl>,
## # hh6.x <dbl+lbl>, hh7.x <dbl+lbl>, mline <dbl+lbl>, fline <dbl+lbl>,
## # ethnicity.x <dbl+lbl>, helevel.x <dbl+lbl>, melevel <dbl+lbl>,
## # felevel <dbl+lbl>, schage <dbl+lbl>, hhweight.x <dbl>, wscore.x <dbl>,
## # windex5.x <dbl+lbl>, wscoreu.x <dbl>, windex5u.x <dbl+lbl>,
## # wscorer.x <dbl>, windex5r.x <dbl+lbl>, hh_id <dbl>, ind_id <dbl>,
## # tot_child_died <int>, tot_child18_died_5y <int>, women_BH <dbl>,
## # wm7 <dbl+lbl>, cm1 <dbl+lbl>, cm8 <dbl+lbl>, cm9a <dbl>, cm9b <dbl>,
## # women_wm <dbl>, marital_wom <fct>, hh1.y <dbl>, hh2.y <dbl>, hh3 <dbl>,
## # hh4 <dbl>, hh5d.y <dbl>, hh5m.y <dbl>, hh5y.y <dbl>, hh6.y <dbl+lbl>,
## # hh7.y <dbl+lbl>, hh9 <dbl+lbl>, hh10 <dbl+lbl>, hh11 <dbl>, hh12 <dbl>,
## # hh13 <dbl>, hh14 <dbl>, hh15 <dbl>, hh16 <dbl>, hh17 <dbl>,
## # hh18h <dbl+lbl>, hh18m <dbl+lbl>, hh19h <dbl+lbl>, hh19m <dbl+lbl>,
## # sl1 <dbl>, sl9a <dbl>, sl9b <dbl+lbl>, sl9c <dbl>, cd_flag <dbl+lbl>,
## # cl2a <dbl+lbl>, cl2b <dbl+lbl>, cl2c <dbl+lbl>, cl2d <dbl+lbl>,
## # cl4 <dbl+lbl>, cl5 <dbl+lbl>, cl6 <dbl+lbl>, cl7a <dbl+lbl>,
## # cl7b <dbl+lbl>, cl7c <dbl+lbl>, cl7d <dbl+lbl>, cl7e <dbl+lbl>,
## # cl7f <dbl+lbl>, cl8 <dbl+lbl>, cl9 <dbl+lbl>, cl10a <dbl+lbl>,
## # cl10b <dbl+lbl>, cl10c <dbl+lbl>, cl10d <dbl+lbl>, cl10e <dbl+lbl>,
## # cl10f <dbl+lbl>, ...
4.1.9 Step 1.9: Renaming Demographic Variables
merged_dat %>%
mutate(
fem_eligible = ifelse(hl7 > 0 & !is.na(hl7), TRUE, NA), ## if hl7 is greater than 0, it cannot be NA, but I will stick to the STATA code for now
) %>%
group_by(hh_id) %>%
mutate(
hh_n_fem_eligible = sum(fem_eligible, na.rm = T)
) %>%
ungroup() %>%
mutate(
no_fem_eligible = structure(hh_n_fem_eligible == 0, label = "Household has no eligible women"),
no_male_eligible = structure(NA, label = "Household has no eligible man"),
no_child_eligible = structure(NA, label = "Household has no children eligible"),
no_adults_eligible = structure(NA,
label = "Household has no eligible women and men"),
no_child_fem_eligible = structure(NA,
label = "Household has no children or women eligible"),
no_eligibles = structure(NA,
label = "Household has no eligible women, men or children"),
no_hem_eligible = structure(NA, label = "Household has no eligible individuals for hemoglobin measurement")
) %>%
select(-c("fem_eligible", "hh_n_fem_eligible")) %>%
mutate(
weight = structure(hhweight.x, label = "Sample Weight"), # Sample Weight
area = structure(ifelse(hh6.x == 2,0,hh6.x),
label = "Area: urban-rural",
labels = c("rural","urban")), #Area: urban or rural||factors in
#R start with 1, to maintain #consistency with stata code #creating these integers as #numerics with levels
sex = structure(hl4,label = "Sex of Household member"), # Sex oh HH member
age = structure(ifelse(hl6>=98,NA,hl6),
label = "Age of household member"), #Age of HH member
agec7 =
structure(
factor(
case_when(
age >= 0 & age <= 4 ~ "0-4",
age >= 5 & age <= 9 ~ "5-9",
age >= 10 & age <= 14 ~ "10-14",
age >= 15 & age <= 17 ~ "15-17",
age >= 18 & age <= 59 ~ "18-59",
age >= 60 ~ "60+"
), levels = c("0-4","5-9","10-14","15-17","18-59","60+")
), label = "Age groups (7groups)"
), # Age group
agec4 =
structure(
factor(
case_when(
age >= 0 & age <= 9 ~ "0-9",
age >= 10 & age <= 17 ~ "10-17",
age >= 18 & age <= 58 ~ "18-59",
age >= 60 ~ "60+"
), levels = c("0-9","10-17","18-59","60+")
), label = "Age groups (4groups)"
), # Age group
member = 1
) %>%
group_by(hh_id) %>%
mutate(
hhsize = structure(sum(member), label = "Household size")
) %>%
ungroup() %>%
tabyl(hhsize, show_na = T) %>%
gt()| hhsize | n | percent |
|---|---|---|
| 1 | 613 | 0.0155662773 |
| 2 | 2872 | 0.0729304215 |
| 3 | 5892 | 0.1496190960 |
| 4 | 11296 | 0.2868461148 |
| 5 | 7860 | 0.1995937024 |
| 6 | 5220 | 0.1325545962 |
| 7 | 2485 | 0.0631030980 |
| 8 | 1288 | 0.0327069578 |
| 9 | 765 | 0.0194261046 |
| 10 | 550 | 0.0139664804 |
| 11 | 231 | 0.0058659218 |
| 12 | 96 | 0.0024377857 |
| 13 | 104 | 0.0026409345 |
| 14 | 14 | 0.0003555104 |
| 15 | 30 | 0.0007618080 |
| 16 | 64 | 0.0016251905 |
Note that for varible agec7, there are only 6 groups but the label in STATA code says 7 groups. Following the STATA code.
merged_dat %>%
mutate(
fem_eligible = ifelse(hl7 > 0 & !is.na(hl7), TRUE, NA), ## if hl7 is greater than 0, it cannot be NA, but I will stick to the STATA code for now
) %>%
group_by(hh_id) %>%
mutate(
hh_n_fem_eligible = sum(fem_eligible, na.rm = T)
) %>%
ungroup() %>%
mutate(
no_fem_eligible = structure(hh_n_fem_eligible == 0, label = "Household has no eligible women"),
no_male_eligible = structure(NA, label = "Household has no eligible man"),
no_child_eligible = structure(NA, label = "Household has no children eligible"),
no_adults_eligible = structure(NA,
label = "Household has no eligible women and men"),
no_child_fem_eligible = structure(NA,
label = "Household has no children or women eligible"),
no_eligibles = structure(NA,
label = "Household has no eligible women, men or children"),
no_hem_eligible = structure(NA, label = "Household has no eligible individuals for hemoglobin measurement")
) %>%
select(-c("fem_eligible", "hh_n_fem_eligible")) %>%
mutate(
weight = structure(hhweight.x, label = "Sample Weight"), # Sample Weight
area = structure(ifelse(hh6.x == 2,0,hh6.x),
label = "Area: urban-rural",
labels = c("rural","urban")), #Area: urban or rural||factors in
#R start with 1, to maintain #consistency with stata code #creating these integers as #numerics with levels
sex = structure(hl4,label = "Sex of Household member"), # Sex oh HH member
age = structure(ifelse(hl6>=98,NA,hl6),
label = "Age of household member"), #Age of HH member
agec7 =
structure(
factor(
case_when(
age >= 0 & age <= 4 ~ "0-4",
age >= 5 & age <= 9 ~ "5-9",
age >= 10 & age <= 14 ~ "10-14",
age >= 15 & age <= 17 ~ "15-17",
age >= 18 & age <= 59 ~ "18-59",
age >= 60 ~ "60+"
), levels = c("0-4","5-9","10-14","15-17","18-59","60+")
), label = "Age groups (7groups)"
), # Age group
agec4 =
structure(
factor(
case_when(
age >= 0 & age <= 9 ~ "0-9",
age >= 10 & age <= 17 ~ "10-17",
age >= 18 & age <= 58 ~ "18-59",
age >= 60 ~ "60+"
), levels = c("0-9","10-17","18-59","60+")
), label = "Age groups (4groups)"
), # Age group
member = 1
) %>%
group_by(hh_id) %>%
mutate(
hhsize = structure(sum(member), label = "Household size")
) %>%
ungroup() %>%
select(-c("member")) %>%
mutate(
region = structure(forcats::fct_relevel(
fct_recode(as_factor(hh7.x),
"North Central & Central Coastal Area" =
"North Central and Central Coastal area",
"Northern Midlands & Mountain Area" =
"Northern Midlands and Mountain area"),
c("Central Highlands",
"Mekong River Delta",
"North Central & Central Coastal Area",
"Northern Midlands & Mountain Area",
"Red River Delta",
"South East")),
label = "Region for Subnational Decomposition")
) %>%
tabyl(hh7.x, region) %>%
gt()| hh7.x | Central Highlands | Mekong River Delta | North Central & Central Coastal Area | Northern Midlands & Mountain Area | Red River Delta | South East |
|---|---|---|---|---|---|---|
| 1 | 0 | 0 | 0 | 0 | 6198 | 0 |
| 2 | 0 | 0 | 0 | 6736 | 0 | 0 |
| 3 | 0 | 0 | 6134 | 0 | 0 | 0 |
| 4 | 7179 | 0 | 0 | 0 | 0 | 0 |
| 5 | 0 | 0 | 0 | 0 | 0 | 6748 |
| 6 | 0 | 6385 | 0 | 0 | 0 | 0 |
Notice how leveling of factors is significantly different in R. The {Forcats} is a very useful tool.
4.2 Step 2: Data preparation
Standardization of the 10 Global MPI Indicatiors.
Identification if non-deprived and deprived individuals.
4.2.1 Step 2.1: Years of Schooling
Comments from the STATA Code:
Note: In Viet Nam, children enter primary school aged 6 years, enter lower secondary school at 11 and upper secondary school at 15. There are grades in primary school (Grades 1 to 5), four in lower secondary school (Grades 6 to 9) and three in upper secondary school (Grades 10 to 12).(pg 192 report)
Let us first save the data frame created in the environment. This way we reduce the length of the code and copy pasting it again and again.
merged_dat %>%
mutate(
fem_eligible = ifelse(hl7 > 0 & !is.na(hl7), TRUE, NA), ## if hl7 is greater than 0, it cannot be NA, but I will stick to the STATA code for now
) %>%
group_by(hh_id) %>%
mutate(
hh_n_fem_eligible = sum(fem_eligible, na.rm = T)
) %>%
ungroup() %>%
mutate(
no_fem_eligible = structure(hh_n_fem_eligible == 0, label = "Household has no eligible women"),
no_male_eligible = structure(NA, label = "Household has no eligible man"),
no_child_eligible = structure(NA, label = "Household has no children eligible"),
no_adults_eligible = structure(NA,
label = "Household has no eligible women and men"),
no_child_fem_eligible = structure(NA,
label = "Household has no children or women eligible"),
no_eligibles = structure(NA,
label = "Household has no eligible women, men or children"),
no_hem_eligible = structure(NA, label = "Household has no eligible individuals for hemoglobin measurement")
) %>%
select(-c("fem_eligible", "hh_n_fem_eligible")) %>%
mutate(
weight = structure(hhweight.x, label = "Sample Weight"), # Sample Weight
area = structure(ifelse(hh6.x == 2,0,hh6.x),
label = "Area: urban-rural",
labels = c("rural","urban")), #Area: urban or rural||factors in
#R start with 1, to maintain #consistency with stata code #creating these integers as #numerics with levels
sex = structure(hl4,label = "Sex of Household member"), # Sex oh HH member
age = structure(ifelse(hl6>=98,NA,hl6),
label = "Age of household member"), #Age of HH member
agec7 =
structure(
factor(
case_when(
age >= 0 & age <= 4 ~ "0-4",
age >= 5 & age <= 9 ~ "5-9",
age >= 10 & age <= 14 ~ "10-14",
age >= 15 & age <= 17 ~ "15-17",
age >= 18 & age <= 59 ~ "18-59",
age >= 60 ~ "60+"
), levels = c("0-4","5-9","10-14","15-17","18-59","60+")
), label = "Age groups (7groups)"
), # Age group
agec4 =
structure(
factor(
case_when(
age >= 0 & age <= 9 ~ "0-9",
age >= 10 & age <= 17 ~ "10-17",
age >= 18 & age <= 58 ~ "18-59",
age >= 60 ~ "60+"
), levels = c("0-9","10-17","18-59","60+")
), label = "Age groups (4groups)"
), # Age group
member = 1
) %>%
group_by(hh_id) %>%
mutate(
hhsize = structure(sum(member), label = "Household size")
) %>%
ungroup() %>%
select(-c("member")) %>%
mutate(
region = structure(forcats::fct_relevel(
fct_recode(as_factor(hh7.x),
"North Central & Central Coastal Area" =
"North Central and Central Coastal area",
"Northern Midlands & Mountain Area" =
"Northern Midlands and Mountain area"),
c("Central Highlands",
"Mekong River Delta",
"North Central & Central Coastal Area",
"Northern Midlands & Mountain Area",
"Red River Delta",
"South East")),
label = "Region for Subnational Decomposition")
) -> merged_datTable ed4b ed4a
| ed4b | 0 | 1 | 2 | 3 | 4 | 5 | 8 | NA_ |
|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 799 | 984 | 715 | 0 | 0 | 1 | 0 |
| 1 | 0 | 1062 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 0 | 1590 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 0 | 1848 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 0 | 1762 | 0 | 0 | 0 | 0 | 0 | 0 |
| 5 | 0 | 2145 | 0 | 0 | 0 | 0 | 0 | 0 |
| 6 | 0 | 0 | 1802 | 0 | 0 | 0 | 0 | 0 |
| 7 | 0 | 0 | 2270 | 0 | 0 | 0 | 0 | 0 |
| 8 | 0 | 0 | 1741 | 0 | 0 | 0 | 0 | 0 |
| 9 | 0 | 0 | 5350 | 0 | 0 | 0 | 0 | 0 |
| 10 | 0 | 0 | 0 | 1225 | 0 | 0 | 0 | 0 |
| 11 | 0 | 0 | 0 | 1008 | 0 | 0 | 0 | 0 |
| 12 | 0 | 0 | 0 | 3533 | 0 | 0 | 0 | 0 |
| 97 | 0 | 0 | 1 | 2 | 0 | 0 | 0 | 0 |
| 98 | 0 | 2 | 2 | 4 | 0 | 0 | 8 | 0 |
| 99 | 0 | 2 | 1 | 4 | 0 | 0 | 8 | 0 |
| NA | 734 | 0 | 0 | 0 | 1553 | 3677 | 0 | 5547 |
Table age ed6a when ed5 is 1
| age | 0 | 1 | 2 | 3 | 4 | 5 | 9 |
|---|---|---|---|---|---|---|---|
| 5 | 623 | 5 | 0 | 0 | 0 | 0 | 1 |
| 6 | 71 | 624 | 0 | 0 | 0 | 0 | 0 |
| 7 | 3 | 658 | 1 | 0 | 0 | 0 | 0 |
| 8 | 1 | 664 | 0 | 0 | 0 | 0 | 3 |
| 9 | 0 | 661 | 0 | 0 | 0 | 0 | 1 |
| 10 | 0 | 664 | 5 | 0 | 0 | 0 | 1 |
| 11 | 1 | 115 | 481 | 0 | 0 | 0 | 0 |
| 12 | 0 | 32 | 550 | 0 | 0 | 0 | 1 |
| 13 | 0 | 8 | 634 | 0 | 0 | 0 | 0 |
| 14 | 0 | 8 | 527 | 4 | 0 | 0 | 0 |
| 15 | 0 | 2 | 128 | 316 | 1 | 0 | 0 |
| 16 | 0 | 1 | 24 | 457 | 1 | 0 | 1 |
| 17 | 0 | 1 | 8 | 462 | 3 | 1 | 0 |
| 18 | 0 | 0 | 1 | 102 | 22 | 133 | 0 |
| 19 | 0 | 0 | 0 | 24 | 32 | 145 | 1 |
| 20 | 0 | 0 | 1 | 5 | 42 | 152 | 0 |
| 21 | 0 | 0 | 0 | 3 | 15 | 134 | 0 |
| 22 | 0 | 0 | 0 | 1 | 7 | 102 | 0 |
| 23 | 0 | 0 | 0 | 0 | 14 | 58 | 0 |
| 24 | 0 | 0 | 0 | 0 | 4 | 26 | 1 |
Creating edulevel and table ed4a, ed3
merged_dat %>%
mutate(
edulevel = ifelse(ed4a >= 8, NA, ed4a),
edulevel = ifelse((is.na(ed3) | ed3 == 2),0,edulevel)
) %>%
tabyl(ed4a, ed3, show_na = T) %>%
gt() | ed4a | 1 | 2 | NA_ |
|---|---|---|---|
| 0 | 734 | 0 | 0 |
| 1 | 9210 | 0 | 0 |
| 2 | 12151 | 0 | 0 |
| 3 | 6491 | 0 | 0 |
| 4 | 1553 | 0 | 0 |
| 5 | 3677 | 0 | 0 |
| 8 | 17 | 0 | 0 |
| NA | 0 | 2201 | 3346 |
Creating eduhighyear and table ed4b, ed43
merged_dat %>%
mutate(
edulevel = ifelse(ed4a >= 8, NA, ed4a),
edulevel = ifelse((is.na(ed3) | ed3 == 2),0,edulevel),
eduhighyear = ifelse((is.na(ed4b)|ed4b == 97| ed4b == 98| ed4b == 99),
NA, ed4b),
eduhighyear = ifelse((is.na(ed3) | ed3 == 2),0,eduhighyear)
) %>%
tabyl(ed4b, ed3, show_na = T) %>%
gt()| ed4b | 1 | 2 | NA_ |
|---|---|---|---|
| 0 | 2499 | 0 | 0 |
| 1 | 1062 | 0 | 0 |
| 2 | 1590 | 0 | 0 |
| 3 | 1848 | 0 | 0 |
| 4 | 1762 | 0 | 0 |
| 5 | 2145 | 0 | 0 |
| 6 | 1802 | 0 | 0 |
| 7 | 2270 | 0 | 0 |
| 8 | 1741 | 0 | 0 |
| 9 | 5350 | 0 | 0 |
| 10 | 1225 | 0 | 0 |
| 11 | 1008 | 0 | 0 |
| 12 | 3533 | 0 | 0 |
| 97 | 3 | 0 | 0 |
| 98 | 16 | 0 | 0 |
| 99 | 15 | 0 | 0 |
| NA | 5964 | 2201 | 3346 |
Now, I feel it is as good a time as any, since we have moved this forward in to mutating new variables, to introduce the {magrittr} pipes. Up till now in the document you must have seen %>% appear everywhere. This is a pipe operator. What it does is to allow us to write code where the format is similar to f(g(y(x))), where f,g, and y are function, and x is input for y and y(x) is input of g and so on.
One must have noticed that if I am mutating multiple varibables of a data frame I ahve to reassign the mutated data frame with -> or <- operator. This can be redundant at times. Consider the following example.
let us say I want a new variable c = a*b in the same data frame ex_pipe and keep rows where b is 11. Below is code to do it with using pipe operator.
ex_pipe %>%
mutate(
c= a * b # creating new variable
) %>% # using pipe to feed the output of mutate in to filter function
filter(
b == 11
) -> ex_pipe # updating the ex_pipe object/ data frame by reassigning.
ex_pipe## # A tibble: 3 x 3
## a b c
## <dbl> <dbl> <dbl>
## 1 1 11 11
## 2 3 11 33
## 3 4 11 44
I might not want to use -> operator as it can get redundant. There is a cousin of the %>% operator that helps us tackle this problem. For instance, say i want ot add 1 to c in each row and update the data prame ex_pipe.
ex_pipe %<>% ## Notice the difference
mutate(
c = c+1
) ##notice no need for arrow opration and reassignment
ex_pipe## # A tibble: 3 x 3
## a b c
## <dbl> <dbl> <dbl>
## 1 1 11 12
## 2 3 11 34
## 3 4 11 45
We will be using %<>% where necessary.
Ok. Back on track now.
updating merged_dat.
merged_dat %<>%
mutate(
edulevel = ifelse(ed4a >= 8, NA, ed4a),
edulevel = ifelse((is.na(ed3) | ed3 == 2),0,edulevel),
eduhighyear = ifelse((is.na(ed4b)|ed4b == 97| ed4b == 98| ed4b == 99),
NA, ed4b),
eduhighyear = structure(ifelse((is.na(ed3) | ed3 == 2),0,eduhighyear),
label = "HIghest year of education completed")
) Table eduhighyear
| eduhighyear | n | percent | valid_percent |
|---|---|---|---|
| 0 | 8046 | 0.20431691 | 0.24102810 |
| 1 | 1062 | 0.02696800 | 0.03181355 |
| 2 | 1590 | 0.04037583 | 0.04763046 |
| 3 | 1848 | 0.04692737 | 0.05535918 |
| 4 | 1762 | 0.04474352 | 0.05278294 |
| 5 | 2145 | 0.05446927 | 0.06425619 |
| 6 | 1802 | 0.04575927 | 0.05398119 |
| 7 | 2270 | 0.05764347 | 0.06800072 |
| 8 | 1741 | 0.04421026 | 0.05215386 |
| 9 | 5350 | 0.13585576 | 0.16026601 |
| 10 | 1225 | 0.03110716 | 0.03669642 |
| 11 | 1008 | 0.02559675 | 0.03019591 |
| 12 | 3533 | 0.08971559 | 0.10583548 |
| NA | 5998 | 0.15231082 | NA |
Cleaning inconsistencies.
merged_dat %<>%
mutate(
eduhighyear = ifelse(age < 10, 0, eduhighyear ),
eduhighyear = ifelse((edulevel == 1 & eduhighyear > 5), NA, eduhighyear ),
eduhighyear = ifelse((edulevel == 2 & eduhighyear > 9), NA, eduhighyear ),
eduhighyear = ifelse((edulevel == 3 & eduhighyear > 12), NA, eduhighyear ),
eduhighyear = ifelse(edulevel == 0, 0, eduhighyear ),
)Creating Years of schooling.
merged_dat %<>%
mutate(
eduyears = ifelse((edulevel == 1 & is.na(eduhighyear)),0, eduhighyear),
eduyears = ifelse((edulevel == 2 & (is.na(eduhighyear)|eduhighyear ==0)),
5,eduyears),
eduyears = ifelse((edulevel == 3 & (is.na(eduhighyear)|eduhighyear ==0)),
9,eduyears),
eduyears = ifelse((edulevel == 4 & (is.na(eduhighyear)|eduhighyear ==0)),
12,eduyears), # the equivalent STATA code fot this is written twice in the do file.
eduyears = ifelse(((edulevel == 4 | edulevel ==5) &
(is.na(eduhighyear)|eduhighyear ==0)),
13,eduyears),
eduyears = ifelse(edulevel == 0 ,0, eduyears),
eduyears = ifelse(is.na(edulevel) ,NA, eduyears),
)Checking for further inconsistencies.
merged_dat %<>%
mutate(
eduyears = ifelse((age <= eduyears & age > 0), NA, eduyears),
eduyears = structure(ifelse(age < 10, 0, eduyears),
label = "Total number of years of education acomplished")
)
# Table eduyears edulevel
merged_dat %>%
tabyl(eduyears, edulevel, show_na = T) %>%
kableExtra::kable()| eduyears | 0 | 1 | 2 | 3 | 4 | 5 | NA_ |
|---|---|---|---|---|---|---|---|
| 0 | 6281 | 2707 | 1 | 0 | 0 | 0 | 0 |
| 1 | 0 | 356 | 0 | 0 | 0 | 0 | 0 |
| 2 | 0 | 951 | 0 | 0 | 0 | 0 | 0 |
| 3 | 0 | 1298 | 0 | 0 | 0 | 0 | 0 |
| 4 | 0 | 1753 | 0 | 0 | 0 | 0 | 0 |
| 5 | 0 | 2145 | 987 | 0 | 0 | 0 | 0 |
| 6 | 0 | 0 | 1802 | 0 | 0 | 0 | 0 |
| 7 | 0 | 0 | 2270 | 0 | 0 | 0 | 0 |
| 8 | 0 | 0 | 1741 | 0 | 0 | 0 | 0 |
| 9 | 0 | 0 | 5350 | 725 | 0 | 0 | 0 |
| 10 | 0 | 0 | 0 | 1225 | 0 | 0 | 0 |
| 11 | 0 | 0 | 0 | 1008 | 0 | 0 | 0 |
| 12 | 0 | 0 | 0 | 3533 | 0 | 0 | 0 |
| 13 | 0 | 0 | 0 | 0 | 1553 | 3677 | 0 |
| NA | 0 | 0 | 0 | 0 | 0 | 0 | 17 |
Table ed4b age
| ed4b | 7 |
|---|---|
| 0 | 1 |
Creating Control variables regarding information on years of education.
merged_dat %<>%
mutate(
temp = ifelse((!is.na(eduyears) & age >= 10 & !is.na(age)), 1, NA)
) %>%
group_by(hh_id) %>%
mutate(
no_missing_edu = sum(temp, na.rm = T)
) %>%
ungroup() %>%
mutate(
temp2 = ifelse((age >= 10 & !is.na(age)),1,NA)
) %>%
group_by(hh_id) %>%
mutate(
hhs = sum(temp2, na.rm = T)
) %>%
ungroup() %>%
mutate(
no_missing_edu = no_missing_edu/hhs,
no_missing_edu = structure(no_missing_edu >= (2/3),
label = "No missing edu for at least 2/3 of the HH members aged 10 years & older")
)
# Table for no_missing_edu
merged_dat %>%
tabyl(no_missing_edu, show_na = T) %>%
gt()| no_missing_edu | n | percent |
|---|---|---|
| FALSE | 8 | 0.0002031488 |
| TRUE | 39372 | 0.9997968512 |
Dropping temp, temp2 and hhs
4.2.1.1 Standard MPI
Comments from STATA code:
The entire household is considered deprived if no household member aged 10 years or older has completed SIX years of schooling.
merged_dat %<>%
mutate(
years_edu6 = ifelse(is.na(eduyears),NA,(eduyears >= 6)),
) %>%
group_by(hh_id) %>%
mutate(
hh_years_edu6_1 = max(years_edu6, na.rm = T)
) %>%
ungroup() %>%
mutate(
hh_years_edu6 = (hh_years_edu6_1 == 1),
hh_years_edu6 = ifelse(is.na(hh_years_edu6_1), NA, hh_years_edu6),
hh_years_edu6 = structure(
ifelse((hh_years_edu6 == 0 & no_missing_edu == 0),
NA, hh_years_edu6),
label = "Household has at least one member with 6 years of edu")
)
# Table hh_years_edu6
merged_dat %>%
tabyl(hh_years_edu6, show_na = T) %>%
gt()| hh_years_edu6 | n | percent |
|---|---|---|
| FALSE | 3439 | 0.08732859 |
| TRUE | 35941 | 0.91267141 |
4.2.2 Step 2.2: Child School Attendance
Creating the attendence variable.
merged_dat %<>%
mutate(
attendance = ifelse(ed5 == 1, 1, NA),
attendance = ifelse(ed5 == 2, 0, attendance),
attendance = ifelse(ed3 == 2, 0, attendance),
)
# Table age ed5
merged_dat %>%
tabyl(age, ed5, show_na = T) %>%
gt()| age | 1 | 2 | 9 | NA_ |
|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 732 |
| 1 | 0 | 0 | 0 | 787 |
| 2 | 0 | 0 | 0 | 634 |
| 3 | 0 | 0 | 0 | 553 |
| 4 | 0 | 0 | 0 | 640 |
| 5 | 629 | 4 | 1 | 34 |
| 6 | 695 | 3 | 0 | 12 |
| 7 | 662 | 1 | 0 | 9 |
| 8 | 668 | 6 | 0 | 9 |
| 9 | 662 | 3 | 0 | 6 |
| 10 | 670 | 5 | 0 | 9 |
| 11 | 597 | 12 | 0 | 9 |
| 12 | 583 | 14 | 0 | 7 |
| 13 | 642 | 45 | 0 | 11 |
| 14 | 539 | 61 | 0 | 14 |
| 15 | 447 | 103 | 1 | 7 |
| 16 | 484 | 159 | 0 | 11 |
| 17 | 475 | 202 | 3 | 18 |
| 18 | 258 | 327 | 4 | 9 |
| 19 | 202 | 339 | 3 | 18 |
| 20 | 200 | 363 | 2 | 18 |
| 21 | 152 | 369 | 7 | 21 |
| 22 | 110 | 460 | 5 | 20 |
| 23 | 72 | 540 | 13 | 22 |
| 24 | 31 | 518 | 17 | 20 |
| 25 | 0 | 0 | 0 | 595 |
| 26 | 0 | 0 | 0 | 605 |
| 27 | 0 | 0 | 0 | 528 |
| 28 | 0 | 0 | 0 | 618 |
| 29 | 0 | 0 | 0 | 604 |
| 30 | 0 | 0 | 0 | 651 |
| 31 | 0 | 0 | 0 | 634 |
| 32 | 0 | 0 | 0 | 575 |
| 33 | 0 | 0 | 0 | 569 |
| 34 | 0 | 0 | 0 | 574 |
| 35 | 0 | 0 | 0 | 540 |
| 36 | 0 | 0 | 0 | 574 |
| 37 | 0 | 0 | 0 | 600 |
| 38 | 0 | 0 | 0 | 550 |
| 39 | 0 | 0 | 0 | 580 |
| 40 | 0 | 0 | 0 | 552 |
| 41 | 0 | 0 | 0 | 583 |
| 42 | 0 | 0 | 0 | 576 |
| 43 | 0 | 0 | 0 | 570 |
| 44 | 0 | 0 | 0 | 579 |
| 45 | 0 | 0 | 0 | 612 |
| 46 | 0 | 0 | 0 | 498 |
| 47 | 0 | 0 | 0 | 514 |
| 48 | 0 | 0 | 0 | 542 |
| 49 | 0 | 0 | 0 | 571 |
| 50 | 0 | 0 | 0 | 567 |
| 51 | 0 | 0 | 0 | 566 |
| 52 | 0 | 0 | 0 | 449 |
| 53 | 0 | 0 | 0 | 561 |
| 54 | 0 | 0 | 0 | 508 |
| 55 | 0 | 0 | 0 | 485 |
| 56 | 0 | 0 | 0 | 441 |
| 57 | 0 | 0 | 0 | 383 |
| 58 | 0 | 0 | 0 | 341 |
| 59 | 0 | 0 | 0 | 370 |
| 60 | 0 | 0 | 0 | 298 |
| 61 | 0 | 0 | 0 | 321 |
| 62 | 0 | 0 | 0 | 222 |
| 63 | 0 | 0 | 0 | 272 |
| 64 | 0 | 0 | 0 | 237 |
| 65 | 0 | 0 | 0 | 225 |
| 66 | 0 | 0 | 0 | 167 |
| 67 | 0 | 0 | 0 | 172 |
| 68 | 0 | 0 | 0 | 132 |
| 69 | 0 | 0 | 0 | 143 |
| 70 | 0 | 0 | 0 | 150 |
| 71 | 0 | 0 | 0 | 155 |
| 72 | 0 | 0 | 0 | 123 |
| 73 | 0 | 0 | 0 | 154 |
| 74 | 0 | 0 | 0 | 125 |
| 75 | 0 | 0 | 0 | 131 |
| 76 | 0 | 0 | 0 | 125 |
| 77 | 0 | 0 | 0 | 105 |
| 78 | 0 | 0 | 0 | 110 |
| 79 | 0 | 0 | 0 | 92 |
| 80 | 0 | 0 | 0 | 110 |
| 81 | 0 | 0 | 0 | 104 |
| 82 | 0 | 0 | 0 | 78 |
| 83 | 0 | 0 | 0 | 113 |
| 84 | 0 | 0 | 0 | 105 |
| 85 | 0 | 0 | 0 | 62 |
| 86 | 0 | 0 | 0 | 51 |
| 87 | 0 | 0 | 0 | 47 |
| 88 | 0 | 0 | 0 | 32 |
| 89 | 0 | 0 | 0 | 22 |
| 90 | 0 | 0 | 0 | 27 |
| 91 | 0 | 0 | 0 | 21 |
| 92 | 0 | 0 | 0 | 23 |
| 93 | 0 | 0 | 0 | 20 |
| 94 | 0 | 0 | 0 | 14 |
| 95 | 0 | 0 | 0 | 29 |
merged_dat %<>%
mutate(
attendance = ifelse((age < 5 | age > 24),0,attendance)
)
# Table Attendance
merged_dat %>%
tabyl(attendance, show_na = T) %>%
gt()| attendance | n | percent | valid_percent |
|---|---|---|---|
| 0 | 30546 | 0.775672930 | 0.7767775 |
| 1 | 8778 | 0.222905028 | 0.2232225 |
| NA | 56 | 0.001422042 | NA |
4.2.2.1 Standard MPI
Comments from STATA code:
The entire household is considered deprived if any school-aged child is not attending school up to class 8.
merged_dat %<>%
mutate(
child_schoolage = (age >= 6 & age <= 14)
)
# count If child_schoolage ==1 & attendance ==.
merged_dat %>%
filter(child_schoolage == 1 & is.na(attendance))## # A tibble: 0 x 234
## # ... with 234 variables: hh1.x <dbl>, hh2.x <dbl>, hl1 <dbl>, hl3 <dbl+lbl>,
## # hl4 <dbl+lbl>, hl5m <dbl+lbl>, hl5y <dbl+lbl>, hl6 <dbl+lbl>,
## # hl7 <dbl+lbl>, hl7b <dbl+lbl>, hl11 <dbl+lbl>, hl12 <dbl+lbl>,
## # hl12a <dbl+lbl>, hl13 <dbl+lbl>, hl14 <dbl+lbl>, hl14a <dbl+lbl>,
## # hl15 <dbl>, ed1 <dbl>, ed3 <dbl+lbl>, ed4a <dbl+lbl>, ed4b <dbl+lbl>,
## # ed5 <dbl+lbl>, ed6a <dbl+lbl>, ed6b <dbl+lbl>, ed7 <dbl+lbl>,
## # ed8a <dbl+lbl>, ed8b <dbl+lbl>, mp3 <dbl>, mp5 <dbl+lbl>, mp6 <dbl+lbl>,
## # mp7 <dbl+lbl>, hh5d.x <dbl>, hh5m.x <dbl>, hh5y.x <dbl>, hh6.x <dbl+lbl>,
## # hh7.x <dbl+lbl>, mline <dbl+lbl>, fline <dbl+lbl>, ethnicity.x <dbl+lbl>,
## # helevel.x <dbl+lbl>, melevel <dbl+lbl>, felevel <dbl+lbl>,
## # schage <dbl+lbl>, hhweight.x <dbl>, wscore.x <dbl>, windex5.x <dbl+lbl>,
## # wscoreu.x <dbl>, windex5u.x <dbl+lbl>, wscorer.x <dbl>,
## # windex5r.x <dbl+lbl>, hh_id <dbl>, ind_id <dbl>, tot_child_died <int>,
## # tot_child18_died_5y <int>, women_BH <dbl>, wm7 <dbl+lbl>, cm1 <dbl+lbl>,
## # cm8 <dbl+lbl>, cm9a <dbl>, cm9b <dbl>, women_wm <dbl>, marital_wom <fct>,
## # hh1.y <dbl>, hh2.y <dbl>, hh3 <dbl>, hh4 <dbl>, hh5d.y <dbl>, hh5m.y <dbl>,
## # hh5y.y <dbl>, hh6.y <dbl+lbl>, hh7.y <dbl+lbl>, hh9 <dbl+lbl>,
## # hh10 <dbl+lbl>, hh11 <dbl>, hh12 <dbl>, hh13 <dbl>, hh14 <dbl>, hh15 <dbl>,
## # hh16 <dbl>, hh17 <dbl>, hh18h <dbl+lbl>, hh18m <dbl+lbl>, hh19h <dbl+lbl>,
## # hh19m <dbl+lbl>, sl1 <dbl>, sl9a <dbl>, sl9b <dbl+lbl>, sl9c <dbl>,
## # cd_flag <dbl+lbl>, cl2a <dbl+lbl>, cl2b <dbl+lbl>, cl2c <dbl+lbl>,
## # cl2d <dbl+lbl>, cl4 <dbl+lbl>, cl5 <dbl+lbl>, cl6 <dbl+lbl>,
## # cl7a <dbl+lbl>, cl7b <dbl+lbl>, cl7c <dbl+lbl>, cl7d <dbl+lbl>, ...
merged_dat %<>%
mutate(
temp = ifelse((child_schoolage == 1 & (!is.na(attendance))),
1, NA)
) %>%
group_by(hh_id) %>%
mutate(
no_missing_atten = sum(temp, na.rm = T)
) %>%
ungroup() %>%
mutate(
temp2 = ifelse(child_schoolage == 1, 1, NA)
) %>%
group_by(hh_id) %>%
mutate(
hhs = sum(temp2, na.rm = T)
) %>%
ungroup() %>%
mutate(
no_missing_atten = ifelse((no_missing_atten == 0 & hhs == 0),
1,no_missing_atten/hhs), # deviation
no_missing_atten = structure((no_missing_atten >= (2/3)),
label = "No missing school attendance for at least 2/3 of the school aged children")
) %>%
select(-c("temp", "temp2", "hhs"))
# Table no_mossing_atten
merged_dat %>%
tabyl(no_missing_atten, show_na = T) %>%
gt()| no_missing_atten | n | percent |
|---|---|---|
| TRUE | 39380 | 1 |
Notice there is a deviation from STATA code : replace no_missing_aten = no_missing_atten/hhs. This is because when both variables are 0 NaNs are generated.
merged_dat %<>%
group_by(hh_id) %>%
mutate(
hh_children_schoolage = sum(child_schoolage, na.rm = TRUE)
) %>%
ungroup() %>%
mutate(
hh_children_schoolage = structure(
hh_children_schoolage > 0,
label = "Household has children in schoolage"
),
child_not_atten = ifelse(child_schoolage == 1, attendance == 0, NA),
child_not_atten = ifelse((is.na(attendance) & child_schoolage == 1),
NA, child_not_atten)
) %>%
group_by(hh_id) %>%
mutate(
any_child_not_atten = max(child_not_atten, na.rm = T)
) %>%
ungroup() %>%
mutate(
hh_child_atten = (any_child_not_atten == 0),
hh_child_atten = ifelse(is.na(any_child_not_atten),
NA, hh_child_atten),
hh_child_atten = ifelse(hh_children_schoolage ==0,
1, hh_child_atten),
hh_child_atten = structure(
ifelse((hh_child_atten == 1 & no_missing_atten == 0),
NA, hh_child_atten),
label = "Household has all school age children up to class 8 in school"
)
)
# Table hh_child_Atten
merged_dat %>%
tabyl(hh_child_atten, show_na = T) %>%
gt()| hh_child_atten | n | percent |
|---|---|---|
| 0 | 1186 | 0.03011681 |
| 1 | 38194 | 0.96988319 |